feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel by elvischenv · Pull Request #2792 · flashinfer-ai/flashinfer

elvischenv · 2026-03-16T02:39:56Z

📌 Description

vLLM is using seqlen=0 padding tokens for running a full cudagraph: https://github.com/vllm-project/vllm/blob/95c0f928cdeeaa21c4906e73cee6a156e1b3b995/vllm/v1/worker/gpu/model_runner.py#L652-L654

Update the following functions:
get_batch_indices_positions_kernel: initialize batch_indices/positions to -1/0 for recognizing the padding tokens
rope_quantize_fp8_append_paged_kv_cache: skip those padding tokens

Testing:
pytest -v -s tests/attention/test_rope.py::test_rope_quantize_fp8_append_paged_kv_cache_padding

======= 4 passed in 1.66s =======

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Paged KV cache append now reliably skips padding entries, preventing unintended overwrites during attention quantization and positional encoding processing.
- Deterministic initialization of internal index/position buffers to avoid uninitialized memory affecting padding behavior.
Tests
- Added parameterized regression test confirming padding entries are ignored and prefilled KV cache data remains exactly unchanged.

gemini-code-assist · 2026-03-16T02:40:11Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the KV cache update mechanism to gracefully handle padding tokens, which is crucial for enabling full CUDA graph functionality in systems like vLLM. By explicitly marking and skipping padding tokens during KV cache writes, the changes prevent data corruption and ensure the integrity of the cache while maintaining performance benefits of fixed-size batches.

Highlights

Padding Token Support: Introduced support for padding tokens within the rope+quant+kv cache update fusion kernel to align with vLLM's full CUDA graph requirements.
Kernel Updates: The get_batch_indices_positions_kernel now explicitly marks padding tokens with batch_indices = -1 and positions = 0. The rope_quantize_fp8_append_paged_kv_cache kernel has been updated to skip processing these marked padding tokens.
Test Coverage: A new test case, test_rope_quantize_fp8_append_paged_kv_cache_padding, was added to ensure that padding tokens do not corrupt the KV cache, simulating a decode batch with padded requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/page.py
- Updated get_batch_indices_positions to pass the nnz argument to the Triton kernel.
flashinfer/triton/page.py
- Modified get_batch_indices_positions_kernel to accept nnz and to fill padding entries with batch_indices=-1 and positions=0.
include/flashinfer/pos_enc.cuh
- Added a conditional check in RopeQuantizeAppendPagedKVCacheKernel to return early if batch_indices is less than 0, effectively skipping padding tokens.
tests/attention/test_rope.py
- Added test_rope_quantize_fp8_append_paged_kv_cache_padding to validate that padding tokens do not corrupt the KV cache.

Activity

The author has indicated that pre-commit checks have been installed and run, and tests have been added or updated as needed, with all tests passing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-16T02:40:19Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Deterministically initialize batch_indices and positions buffers and add a kernel-level guard so tokens with batch_indices = -1 are skipped, preventing RoPE, quantization, and paged KV cache append work for padding tokens.

Changes

Cohort / File(s)	Summary
Tensor Initialization `flashinfer/page.py`	`get_batch_indices_positions` now initializes `batch_indices` to `-1` (instead of uninitialized) when allocating and fills provided `batch_indices` with `-1`; `positions` is initialized to zeros when not provided.
Kernel-Level Padding Guards `include/flashinfer/pos_enc.cuh`	`RopeQuantizeAppendPagedKVCacheKernel` token-processing block is wrapped in `if (batch_indices[idx] >= 0)` so page/entry computation, RoPE cos/sin loads, quantization, and paged KV cache append paths are skipped for padding indices.
Test Coverage `tests/attention/test_rope.py`	Added `test_rope_quantize_fp8_append_paged_kv_cache_padding` which constructs paged-KV metadata with padded requests, asserts `batch_indices` padding markers, invokes the kernel, and verifies padded cache entries remain byte-identical to prefilled snapshots across attention types/layouts.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host (CPU)
    participant Kernel as RopeQuantizeAppendPagedKVCacheKernel (GPU)
    participant KV as Paged KV Cache
    Host->>Host: prepare inputs (batch_indices, positions)
    Host->>Kernel: launch kernel with inputs
    Kernel->>Kernel: compute global idx
    alt batch_indices[idx] >= 0
        Kernel->>Kernel: compute page/entry\napply RoPE, quantize
        Kernel->>KV: append/store K/V/Q into paged cache
    else batch_indices[idx] < 0
        Kernel->>Kernel: skip RoPE/quantize/cache ops
    end
    Kernel-->>Host: kernel completes

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat(gdn): add padding index guard for bf16 decode kernel #2810: Adds handling for negative/padding indices in GPU kernels and tests validating padding behavior.

Suggested reviewers

yzh119
kahyunnam
bkryu
nvmbreughe
jiahanc

Poem

🐰
I mark the padded hops with -1 bright,
So kernels skip the places out of sight,
RoPE stays neat, the cache keeps its lore,
No stray bytes tumble — calm on the floor,
A tiny hop for correctness tonight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding support for padding tokens with seqlen=0 in the rope+quant+kv cache update fusion kernel.
Description check	✅ Passed	The PR description covers the motivation (vLLM usage), code changes (two updated functions), test results (4 passed), and completed checklist items. All critical sections from the template are addressed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adds support for padding tokens in the rope+quant+kv cache update fused kernel, which is useful for cudagraphs. The approach involves modifying get_batch_indices_positions_kernel to mark padding tokens and updating RopeQuantizeAppendPagedKVCacheKernel to skip them. A new test case is added to validate this padding logic. While the implementation changes seem correct, I've identified issues in the new test case where token positions are calculated incorrectly. This could cause the test to pass while not properly verifying the intended behavior, potentially masking bugs. I've provided suggestions to correct the test logic.

tests/attention/test_rope.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/attention/test_rope.py (1)

1390-1590: Add enable_pdl coverage to this new padding regression test.

Lines 1392-1589 only exercise the default path. Please parameterize enable_pdl and pass it into the fused call so padding behavior is validated under the programmatic dependent launch mode too.

Proposed test update

 `@pytest.mark.parametrize`("kv_layout", ["NHD", "HND"])
 `@pytest.mark.parametrize`("page_size", [16])
+@pytest.mark.parametrize("enable_pdl", [True, False])
 def test_rope_quantize_fp8_append_paged_kv_cache_padding(
@@
     kv_layout,
     page_size,
+    enable_pdl,
 ):
@@
     flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(
@@
         quant_scale_kv=1.0,
         is_neox=False,
+        enable_pdl=enable_pdl,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_rope.py` around lines 1390 - 1590, The test
test_rope_quantize_fp8_append_paged_kv_cache currently only runs the default
path; add a pytest parameterization for enable_pdl (e.g.,
`@pytest.mark.parametrize`("enable_pdl",[False,True]) and add enable_pdl to the
test signature) and pass enable_pdl into the fused call
flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(enable_pdl=enable_pdl,
... ) so the padding behavior is validated under programmatic dependent launch
mode as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-865: Replace the early-return on the batch_indices check so
all threads reach the PDL epilogue: remove the "if (batch_indices[idx] < 0)
return;" and instead wrap the work body that follows (the block currently
between lines 867–1030) in "if (batch_indices[idx] >= 0) { ... }". Keep the
final epilogue (including the griddepcontrol.launch_dependents instruction)
outside that guard so it executes unconditionally for every thread in the block;
ensure you reference the same batch_indices and idx variables used in the
kernel.

---

Nitpick comments:
In `@tests/attention/test_rope.py`:
- Around line 1390-1590: The test test_rope_quantize_fp8_append_paged_kv_cache
currently only runs the default path; add a pytest parameterization for
enable_pdl (e.g., `@pytest.mark.parametrize`("enable_pdl",[False,True]) and add
enable_pdl to the test signature) and pass enable_pdl into the fused call
flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(enable_pdl=enable_pdl,
... ) so the padding behavior is validated under programmatic dependent launch
mode as well.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6761f8e1-64cf-4451-8fe3-a2914f272b95

📥 Commits

Reviewing files that changed from the base of the PR and between b418bc3 and 476de9c.

📒 Files selected for processing (4)

flashinfer/page.py
flashinfer/triton/page.py
include/flashinfer/pos_enc.cuh
tests/attention/test_rope.py

include/flashinfer/pos_enc.cuh

elvischenv · 2026-03-16T03:02:11Z

include/flashinfer/pos_enc.cuh

+    // skip padding tokens with batch_indices < 0
+    if (batch_indices[idx] >= 0) {


The main change is just this line. The following is just indent formatting.

coderabbitai

🧹 Nitpick comments (1)

include/flashinfer/pos_enc.cuh (1)

862-868: Make padding-sentinel check type-safe across PagedKVIdType instantiations.

Line 863 uses batch_indices[idx] >= 0, which is only safe when PagedKVIdType is signed. If it is ever unsigned, the padding sentinel -1 becomes the maximum value and this branch incorrectly passes, leading to invalid indptr indexing.

🔧 Proposed fix

-    // skip padding tokens with batch_indices < 0
-    if (batch_indices[idx] >= 0) {
+    constexpr PagedKVIdType kPaddingSentinel = static_cast<PagedKVIdType>(-1);
+    const PagedKVIdType batch_idx = batch_indices[idx];
+    if (batch_idx != kPaddingSentinel) {
       // Compute page location for this token
       uint32_t page_iter, entry_idx;
       paged_kv_like.page_size.divmod(
-          paged_kv_like.indptr[batch_indices[idx]] * paged_kv_like.page_size + positions[idx],
+          paged_kv_like.indptr[batch_idx] * paged_kv_like.page_size + positions[idx],
           page_iter, entry_idx);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 862 - 868, The current padding
check uses "batch_indices[idx] >= 0" which breaks for unsigned PagedKVIdType;
change it to a type-safe sentinel comparison (e.g., compare against a named
padding sentinel value) so you explicitly test for the padding marker instead of
signedness. Replace the condition in the block using batch_indices, positions
and paged_kv_like (the if around batch_indices[idx] >= 0 that precedes
paged_kv_like.indptr[...] and page_size.divmod calls) with a check like
"batch_indices[idx] != static_cast<PagedKVIdType>(-1)" or, better, introduce a
constexpr PagedKVIdType kPaddingSentinel = static_cast<PagedKVIdType>(-1) and
use "batch_indices[idx] != kPaddingSentinel" to ensure correct behavior for both
signed and unsigned instantiations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-868: The current padding check uses "batch_indices[idx] >= 0"
which breaks for unsigned PagedKVIdType; change it to a type-safe sentinel
comparison (e.g., compare against a named padding sentinel value) so you
explicitly test for the padding marker instead of signedness. Replace the
condition in the block using batch_indices, positions and paged_kv_like (the if
around batch_indices[idx] >= 0 that precedes paged_kv_like.indptr[...] and
page_size.divmod calls) with a check like "batch_indices[idx] !=
static_cast<PagedKVIdType>(-1)" or, better, introduce a constexpr PagedKVIdType
kPaddingSentinel = static_cast<PagedKVIdType>(-1) and use "batch_indices[idx] !=
kPaddingSentinel" to ensure correct behavior for both signed and unsigned
instantiations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a25b44d8-52fc-4514-8eeb-44210f1c0877

📥 Commits

Reviewing files that changed from the base of the PR and between 476de9c and 54c5fff.

📒 Files selected for processing (1)

include/flashinfer/pos_enc.cuh

coderabbitai

🧹 Nitpick comments (1)

include/flashinfer/pos_enc.cuh (1)
862-863: Consider adding a compile-time assertion to document the signed type requirement for the padding check.

The code assumes PagedKVIdType is signed to make the check batch_indices[idx] >= 0 work correctly for detecting padding tokens (marked with -1). While the type is hardcoded to int32_t in the JIT binding at csrc/rope.cu (lines 574, 605) and in Python via torch.int(), a static assertion would document this requirement and guard against accidental misuse:
static_assert(std::is_signed<PagedKVIdType>::value,
              "PagedKVIdType must be signed for padding token detection (batch_indices < 0)");
This defensive check aligns with the kernel's assumptions and improves long-term maintainability without adding runtime overhead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 862 - 863, Add a compile-time
assertion that PagedKVIdType is a signed type to document and enforce the
kernel's assumption used by the padding check `batch_indices[idx] >= 0`; insert
a static_assert using `std::is_signed<PagedKVIdType>::value` (e.g., near the
typedef/using of PagedKVIdType or at the top of the kernel in pos_enc.cuh before
the `batch_indices` usage) with a clear message like "PagedKVIdType must be
signed for padding token detection (batch_indices < 0)"; this is purely
compile-time and has no runtime overhead but prevents accidental unsigned types
from being used.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-863: Add a compile-time assertion that PagedKVIdType is a
signed type to document and enforce the kernel's assumption used by the padding
check `batch_indices[idx] >= 0`; insert a static_assert using
`std::is_signed<PagedKVIdType>::value` (e.g., near the typedef/using of
PagedKVIdType or at the top of the kernel in pos_enc.cuh before the
`batch_indices` usage) with a clear message like "PagedKVIdType must be signed
for padding token detection (batch_indices < 0)"; this is purely compile-time
and has no runtime overhead but prevents accidental unsigned types from being
used.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de20e129-16a7-4945-80ef-553ab0f8df70

📥 Commits

Reviewing files that changed from the base of the PR and between 54c5fff and 63197ac.

📒 Files selected for processing (3)

flashinfer/page.py
include/flashinfer/pos_enc.cuh
tests/attention/test_rope.py

🚧 Files skipped from review as they are similar to previous changes (2)

flashinfer/page.py
tests/attention/test_rope.py

elvischenv · 2026-03-17T11:56:48Z

Hi @yzh119, could you help review this? We need this fix for integrating this kernel to vLLM. Thanks!

elvischenv · 2026-03-18T09:45:19Z

cc @kahyunnam for viz.

bkryu · 2026-03-20T06:30:12Z

/bot run

flashinfer-bot · 2026-03-20T06:30:32Z

GitLab MR !438 has been created, and the CI pipeline #46584451 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-20T11:16:50Z

[SUCCESS] Pipeline #46584451: 14/20 passed

nvpohanh · 2026-03-23T11:54:16Z

/bot run

flashinfer-bot · 2026-03-23T11:55:16Z

GitLab MR !438 has been created, and the CI pipeline #46776615 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-23T23:59:00Z

[FAILED] Pipeline #46776615: 12/20 passed

nvpohanh · 2026-03-30T15:26:07Z

/bot run

flashinfer-bot · 2026-03-30T15:26:42Z

GitLab MR !438 has been updated with latest changes, and the CI pipeline #47263242 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-30T22:01:43Z

[FAILED] Pipeline #47263242: 13/20 passed

nvpohanh · 2026-03-31T00:34:54Z

@elvischenv could you rebase again?

nvpohanh · 2026-03-31T02:01:46Z

/bot run

flashinfer-bot · 2026-03-31T02:01:57Z

GitLab MR !438 has been updated with latest changes, and the CI pipeline #47308264 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-31T06:11:12Z

[FAILED] Pipeline #47308264: 7/20 passed

nvpohanh · 2026-03-31T09:19:33Z

@bkryu Could you review this or assign someone to review this? Thanks!

nvpohanh · 2026-04-07T07:55:27Z

@bkryu could you review this? Thanks

bkryu · 2026-04-07T20:35:07Z

/bot run

flashinfer-bot · 2026-04-07T20:35:27Z

GitLab MR !438 has been updated with latest changes, and the CI pipeline #47955669 is currently running. I'll report back once the pipeline job completes.

bkryu · 2026-04-07T20:37:57Z

flashinfer/page.py

Hi @elvischenv, the changes generally look correct and the CI does seem to pass.

However, Have you tried measuring the performance implications? Asking because torch.full or torch.zeros tend to call memset which have a higher overhead than torch.empty. I'm wondering whether there will be a noticeable performance difference from it.

@elvischenv is this called per decoding step or per attention layer? If this is per decoding step, I am less worried about the additional memsets. But if this is per-layer, it may be noticeable perf overhead.

get_batch_indices_positions is a helper function, preparing the needed arguments for rope_quantize_fp8_append_paged_kv_cache, should only be called per decoding step. Then the whole iteration can reuse the same batch_indices and positions, which won't produce the noticeable overhead.

@bkryu Once per decoding step should be okay? Do you agree?

I agree that it should be fine.

flashinfer-bot · 2026-04-08T01:13:36Z

[FAILED] Pipeline #47955669: 11/20 passed

nvpohanh · 2026-04-09T10:50:12Z

@bkryu Are the failures known ones or caused by this PR?

bkryu

CI failures are unrelated. LGTM!

bkryu · 2026-04-09T17:41:58Z

flashinfer/page.py

I agree that it should be fine.

elvischenv requested review from Anerudhan, bkryu, jiahanc, jimmyzho, kahyunnam, nv-yunzheq, nvmbreughe and yzh119 as code owners March 16, 2026 02:39

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

tests/attention/test_rope.py Show resolved Hide resolved

tests/attention/test_rope.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

include/flashinfer/pos_enc.cuh Outdated Show resolved Hide resolved

elvischenv commented Mar 16, 2026

View reviewed changes

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

elvischenv mentioned this pull request Mar 16, 2026

Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention vllm-project/vllm#36858

Open

5 tasks

elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch 2 times, most recently from f0f6c80 to 63197ac Compare March 16, 2026 07:29

elvischenv changed the title ~~feat: Support padding token for rope+quant+kv cache update fusion kernel~~ feat: Support 0 seqlen padding tokens for rope+quant+kv cache update fusion kernel Mar 16, 2026

elvischenv changed the title ~~feat: Support 0 seqlen padding tokens for rope+quant+kv cache update fusion kernel~~ feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel Mar 16, 2026

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

bkryu added the run-ci label Mar 20, 2026

elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch from 63197ac to 832ac30 Compare March 24, 2026 12:28

elvischenv requested review from aleozlx, cyx-6, samuellees, yongwww and yyihuang as code owners March 24, 2026 12:28

elvischenv mentioned this pull request Mar 26, 2026

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS sgl-project/sglang#15729

Open

6 tasks

elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch from 832ac30 to 09fad56 Compare March 30, 2026 01:43

elvischenv added 2 commits March 30, 2026 18:49

support token padding for rope fusion kernel

b9fe669

fix pdl

c182b4c

elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch from 09fad56 to c182b4c Compare March 31, 2026 01:49

nvpohanh mentioned this pull request Mar 31, 2026

[Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker vllm-project/vllm#30758

Open

11 tasks

Merge branch 'main' into elvischenv/support-rope-fusion-token-padding

6d1f919

flashinfer-bot added the op: attention label Apr 7, 2026

bkryu reviewed Apr 7, 2026

View reviewed changes

bkryu approved these changes Apr 9, 2026

View reviewed changes

flashinfer/page.py

Copy link
Copy Markdown

Collaborator

bkryu Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be fine.

bkryu merged commit b705b67 into flashinfer-ai:main Apr 9, 2026
40 of 60 checks passed

elvischenv deleted the elvischenv/support-rope-fusion-token-padding branch April 12, 2026 17:36

		// skip padding tokens with batch_indices < 0
		if (batch_indices[idx] >= 0) {

Conversation

elvischenv commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elvischenv Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

elvischenv commented Mar 17, 2026

Uh oh!

elvischenv commented Mar 18, 2026

Uh oh!

bkryu commented Mar 20, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

nvpohanh commented Mar 23, 2026

Uh oh!

flashinfer-bot commented Mar 23, 2026

Uh oh!

flashinfer-bot commented Mar 23, 2026

Uh oh!

nvpohanh commented Mar 30, 2026

Uh oh!

flashinfer-bot commented Mar 30, 2026

Uh oh!

flashinfer-bot commented Mar 30, 2026

Uh oh!

nvpohanh commented Mar 31, 2026

Uh oh!

nvpohanh commented Mar 31, 2026

Uh oh!

flashinfer-bot commented Mar 31, 2026

Uh oh!

flashinfer-bot commented Mar 31, 2026

Uh oh!

nvpohanh commented Mar 31, 2026

Uh oh!

nvpohanh commented Apr 7, 2026

elvischenv commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading